Text Analysis of Biden and Trump Speeches During the 2020 Presidential Election

Introduction

The United States presidential election is one of the most followed political events in the world. As such, there are many who study the data involved in the hopes of both making predictions and informing the public on the current state of the election. In this blog post, we analyze data from a key component of the election process: speeches given by the candidates. In particular, we analyze text from speeches given by Joe Biden and Donald Trump during the lead up to the 2020 election. Our primary questions we hoped to answer were:

  1. What are the most common words and phrases used by Trump and Biden?
  2. What are the relationships between those words/phrases?
  3. How did the frequency of these words/phrases change over time?

Data

Visualizations

In order to address our three posed questions, we created three types of visualizations, one for each question. To identify the most frequent words used in their speeches, we created wordclouds with fontsize corresponding to word frequency. To identify relationships between the words, we created network graphs with edge sizes corresponding to “closeness” of these words within the documents. (We will define “closeness” in the network section). Lastly, we created line graphs to identify changes in word frequencies over time.

Word Frequency Wordclouds

Network Visualizations

To understand the relationships between speech words, we looked at two types of words: the most common words across all speeches and popular election topics such as climate change, health care, and COVID-19. For each of these sets, we needed to define some metric for “closeness”. To do this, we emulated an analysis of Game of Thrones. Specifically, we defined the closeness of two words (within the full dataset) as the number of times that the words occur within d words of each other in a single speech, where d is a parameter specifying this word distance. For the first set of words (the most common words across all speeches), we chose d to be 10 (a relatively low value) since the words considered are more generic in nature (than the other set of words), and therefore a lower choice of d will pick up on more significant relationships. On the other hand, we chose d to be 50.

Text Mining

Talk about Python here (and include non-runnable chunk)

Network Analysis for Most Commonly Used Words

Speech Analysis Over Time

Limitations, Pitfalls, and Future Research